13 research outputs found

    Whole genome sequencing of Turkish genomes reveals functional private alleles and impact of genetic interactions with Europe, Asia and Africa

    Get PDF
    Background: Turkey is a crossroads of major population movements throughout history and has been a hotspot of cultural interactions. Several studies have investigated the complex population history of Turkey through a limited set of genetic markers. However, to date, there have been no studies to assess the genetic variation at the whole genome level using whole genome sequencing. Here, we present whole genome sequences of 16 Turkish individuals resequenced at high coverage (32 × -48×). Results: We show that the genetic variation of the contemporary Turkish population clusters with South European populations, as expected, but also shows signatures of relatively recent contribution from ancestral East Asian populations. In addition, we document a significant enrichment of non-synonymous private alleles, consistent with recent observations in European populations. A number of variants associated with skin color and total cholesterol levels show frequency differentiation between the Turkish populations and European populations. Furthermore, we have analyzed the 17q21.31 inversion polymorphism region (MAPT locus) and found increased allele frequency of 31.25% for H1/H2 inversion polymorphism when compared to European populations that show about 25% of allele frequency. Conclusion: This study provides the first map of common genetic variation from 16 western Asian individuals and thus helps fill an important geographical gap in analyzing natural human variation and human migration. Our data will help develop population-specific experimental designs for studies investigating disease associations and demographic history in Turkey. © 2014 Alkan et al

    Faster Approximate String Matching for Short Patterns

    Full text link
    We study the classical approximate string matching problem, that is, given strings PP and QQ and an error threshold kk, find all ending positions of substrings of QQ whose edit distance to PP is at most kk. Let PP and QQ have lengths mm and nn, respectively. On a standard unit-cost word RAM with word size w≄log⁥nw \geq \log n we present an algorithm using time O(nk⋅min⁥(log⁥2mlog⁥n,log⁥2mlog⁥ww)+n) O(nk \cdot \min(\frac{\log^2 m}{\log n},\frac{\log^2 m\log w}{w}) + n) When PP is short, namely, m=2o(log⁥n)m = 2^{o(\sqrt{\log n})} or m=2o(w/log⁥w)m = 2^{o(\sqrt{w/\log w})} this improves the previously best known time bounds for the problem. The result is achieved using a novel implementation of the Landau-Vishkin algorithm based on tabulation and word-level parallelism.Comment: To appear in Theory of Computing System

    SCALCE: Boosting sequence compression algorithms using locally consistent encoding

    Get PDF
    Motivation: The high throughput sequencing (HTS) platforms generate unprecedented amounts of data that introduce challenges for the computational infrastructure. Datamanagement, storage and analysis have become major logistical obstacles for those adopting the new platforms. The requirement for large investment for this purpose almost signalled the end of the Sequence Read Archive hosted at the National Center for Biotechnology Information (NCBI), which holds most of the sequence data generated world wide. Currently, most HTS data are compressed through general purpose algorithms such as gzip. These algorithms are not designed for compressing data generated by theHTSplatforms; for example, they do not take advantage of the specific nature of genomic sequence data, that is, limited alphabet size and high similarity among reads. Fast and efficient compression algorithms designed specifically forHTS data should be able to address some of the issues in data management, storage and communication. Such algorithms would also help with analysis provided they offer additional capabilities such as random access to any read and indexing for efficient sequence similarity search. Here we present SCALCE, a 'boosting' scheme based on Locally Consistent Parsing technique which reorganizes the reads in a way that results in a higher compression speed and compression rate, independent of the compression algorithm in use and without using a reference genome. Results: Our tests indicate that SCALCE can improve the compression rate achieved through gzip by a factor of 4.19-when the goal is to compress the reads alone. In fact, on SCALCE reordered reads gzip running time can improve by a factor of 15.06 on a standard PC with a single core and 6 GB memory. Interestingly even the running time of SCALCE \+ gzip improves that of gzip alone by a factor of 2.09. When compared with the recently published BEETL, which aims to sort the (inverted) reads in lexicographic order for improving bzip2 SCALCE\+gzip provides up to 2.01 times better compression while improving the running time by a factor of 5.17. SCALCE also provides the option to compress the quality scores as well as the read names in addition to the reads themselves. This is achieved by compressing the quality scores through order-3 Arithmetic Coding (AC) and the read names through gzip through the reordering SCALCE provides on the reads. This way, in comparison with gzip compression of the unordered FASTQ files (including reads, read names and quality scores), SCALCE (together with gzip and arithmetic encoding) can provide up to 3.34 improvement in the compression rate and 1.26 improvement in running time. © The Author 2012. Published by Oxford University Press. All rights reserved

    Mirroring trees in the light of their topologies

    No full text
    Motivation: Determining the interaction partners among protein/domain families poses hard computational problems, in particular in the presence of paralogous proteins. Available approaches aim to identify interaction partners among protein/domain families through maximizing the similarity between trimmed versions of their phylogenetic trees. Since maximization of any natural similarity score is computationally difficult, many approaches employ heuristics to evaluate the distance matrices corresponding to the tree topologies in question. In this article, we devise an efficient deterministic algorithm which directly maximizes the similarity between two leaf labeled trees with edge lengths, obtaining a score-optimal alignment of the two trees in question. Results: Our algorithm is significantly faster than those methods based on distance matrix comparison: 1 min on a single processor versus 730 h on a supercomputer. Furthermore, we outperform the current state-of-the-art exhaustive search approach in terms of precision, while incurring acceptable losses in recall. Availability: A C implementation of the method demonstrated in this article is available at http://compbio.cs.sfu.ca/mirrort.htm Contact: [email protected]; [email protected]; [email protected]

    Efficient algorithms for finding submasses in weighted strings

    No full text
    We study the Submass Finding Problem: Given a string s over a weighted alphabet, i.e., an alphabet S with a weight function ” : S ¿ N, decide for an input mass M whether s has a substring whose weights sum up to M. If M is indeed a submass, then we want to find one or all occurrences of such substrings. We present efficient algorithms for both the decision and the search problem. Furthermore, our approach allows us to compute efficiently the number of different submasses of s. The main idea of our algorithms is to define appropriate polynomials such that we can determine the solution for the Submass Finding Problem from the coefficients of the product of these polynomials. We obtain very efficient running times by using Fast Fourier Transform to compute this product. Our main algorithm for the decision problem runs in time O(”s log ”s), where ”s is the total mass of string s. Employing standard methods for compressing sparse polynomials, this runtime can be viewed as O(s(s) log2 s(s)), where s(s) denotes the number of different submasses of s. In this case, the runtime is independent of the size of the individual masses of characters

    Measuring the Difficulty of Distance-Based Indexing

    No full text
    Abstract. Data structures for similarity search are commonly evalu-ated on data in vector spaces, but distance-based data structures are also applicable to non-vector spaces with no natural concept of dimen-sionality. The intrinsic dimensionality statistic of Chávez and Navarro provides a way to compare the performance of similarity indexing and search algorithms across different spaces, and predict the performance of index data structures on non-vector spaces by relating them to equiv-alent vector spaces. We characterise its asymptotic behaviour, and give experimental results to calibrate these comparisons.

    The degree distribution of the generalized duplication model

    Get PDF
    AbstractWe study and generalize the duplication model of Pastor-Satorras et al. [Evolving protein interaction networks through gene duplication, J. Theor. Biol. 222 (2003) 199–210]. This model generates a graph by iteratively “duplicating” a randomly chosen node as follows: we start at t0 with a fixed graph G(t0) of size t0. At each step t>t0 a new node vt is added. The node vt selects an existing node u from V(G(t-1))={v1,
,vt-1} uniformly at random (uar). The node vt then connects to each neighbor of the node u in G(t-1) independently with probability p. Additionally, vt connects uar to every node of V(G(t-1)) independently with probability r/t, and parallel edges are merged. Unlike other copy-based models, the degree of the node vt in this model is not fixed in advance; rather it depends strongly on the degree of the original node u it selected.Our main contributions are as follows: we show that (1) the duplication model of Pastor-Satorras et al. does not generate a truncated power-law degree distribution as stated in Pastor-Satorras et al. [Evolving protein interaction networks through gene duplication, J. Theor. Biol. 222 (2003) 199–210]. (2) The special case where r=0 does not give a power-law degree distribution as stated in Chung et al. [Duplication models for biological networks, J. Comput. Biol. 10 (2003) 677–687]. (3) We generalize the Pastor-Satorras et al. duplication process to ensure (if required) that the minimum degree of all vertices is positive. We prove that this generalized model has a power-law degree distribution
    corecore